Normal Distribution
Normal Distribution: Definition and Properties (Symmetry, Bell Shape)
Definition
The **Normal Distribution**, often referred to as the **Gaussian Distribution** (named after Carl Friedrich Gauss) or colloquially as the "bell curve," is the most significant and widely used continuous probability distribution. Its importance stems from its ability to model numerous naturally occurring phenomena and its central role in statistical theory, particularly due to the Central Limit Theorem.
A **continuous random variable** $X$ is said to follow a normal distribution if its probability distribution exhibits a characteristic symmetric, bell-shaped curve. The shape and position of this curve are entirely determined by two parameters:
- **Mean ($\mu$):** This parameter represents the **center** of the distribution. It is the location of the peak of the bell curve and is also equal to the median and mode of the distribution.
- **Standard Deviation ($\sigma$):** This parameter measures the **spread** or variability of the distribution. It determines how wide or narrow the bell curve is. A larger value of $\sigma$ indicates greater variability and results in a shorter, wider curve. A smaller value of $\sigma$ indicates less variability and results in a taller, narrower curve. The standard deviation must be positive ($\sigma > 0$).
A normal distribution with mean $\mu$ and variance $\sigma^2$ is denoted by the notation $X \sim N(\mu, \sigma^2)$. Note that some texts or software might use the standard deviation in the notation, e.g., $N(\mu, \sigma)$, so it's important to clarify whether the second parameter is variance or standard deviation.
Properties of the Normal Distribution
The normal distribution has several key properties that contribute to its importance and ease of use:
-
Bell Shape:
The graphical representation of the normal distribution's probability density function is a distinctive, symmetrical, bell-shaped curve.
-
Symmetry:
The normal curve is perfectly symmetric about its vertical axis passing through the mean ($\mu$). This symmetry implies that the distribution is not skewed.
Due to this symmetry, the mean, median, and mode of a normal distribution are all located at the same point:
Mean = Median = Mode = $\mu$
... (i)
-
Unimodal:
The distribution has a single peak, which occurs at the mean, median, and mode ($\mu$).
-
Asymptotic Tails:
The tails of the normal curve extend indefinitely in both directions, approaching the horizontal axis asymptotically. This means the curve gets infinitely close to the x-axis but never actually touches it, indicating that technically, any real number is a possible value for a normally distributed variable, although values far from the mean have extremely low probabilities.
-
Total Area Under the Curve:
As with any continuous probability distribution, the total area under the probability density curve and above the horizontal axis is equal to 1. This represents the total probability of all possible outcomes, which must sum to 1 (or 100%).
$$\int_{-\infty}^{\infty} f(x) dx = 1$$
... (ii)
(Where $f(x)$ is the probability density function of the normal distribution).
-
Empirical Rule (68-95-99.7 Rule):
A very useful property for interpreting normal distributions is the empirical rule, which states the approximate percentage of data that falls within certain standard deviations of the mean:
- Approximately **68%** of the data falls within **one** standard deviation of the mean ($\mu \pm \sigma$).
- Approximately **95%** of the data falls within **two** standard deviations of the mean ($\mu \pm 2\sigma$).
- Approximately **99.7%** of the data falls within **three** standard deviations of the mean ($\mu \pm 3\sigma$).
This rule provides a quick way to understand the spread of data in a normal distribution.
These properties make the normal distribution mathematically tractable and applicable to a wide range of statistical problems.
Probability Density Function of Normal Distribution (Implicit)
Probability for Continuous Random Variables
For a **continuous random variable** $X$, we cannot talk about the probability of $X$ taking on a single specific value (since there are infinitely many values, the probability of any single value is effectively zero). Instead, probability for a continuous random variable is defined over intervals.
The distribution of a continuous random variable is described by a **Probability Density Function (PDF)**, typically denoted by $f(x)$. The PDF does not give probabilities directly, but its value at any given point indicates the relative likelihood of the variable taking a value around that point.
Key characteristics of a Probability Density Function $f(x)$ for a continuous random variable:
- The function values are non-negative: $f(x) \ge 0$ for all $x$.
- The probability that the random variable $X$ falls within a specific interval $[a, b]$ is given by the **area under the PDF curve** between $a$ and $b$. This is calculated using integration:
$$P(a \le X \le b) = \int_{a}^{b} f(x) dx$$
... (1)
- The total area under the entire PDF curve over all possible values of $X$ must be equal to 1, representing the total probability:
$$\int_{-\infty}^{\infty} f(x) dx = 1$$
... (2)
The Normal PDF Formula
The normal distribution $N(\mu, \sigma^2)$ is defined by a specific mathematical formula for its Probability Density Function, $f(x)$. This formula dictates the precise shape of the normal curve (bell shape, symmetry, etc.) for any given values of the mean ($\mu$) and standard deviation ($\sigma$).
The formula for the PDF of a normal distribution is:
$$f(x \, | \, \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^2}$$
... (3)
for $-\infty < x < \infty$, $\mu \in \mathbb{R}$, and $\sigma > 0$.
Where:
- $x$: The value of the random variable.
- $\mu$: The mean of the distribution.
- $\sigma$: The standard deviation of the distribution.
- $e$: The base of the natural logarithm ($e \approx 2.71828$).
- $\pi$: The mathematical constant Pi ($\pi \approx 3.14159$).
While calculating probabilities by integrating this function is complex and usually done using statistical software or tables, understanding the formula confirms that the distribution is completely determined by $\mu$ and $\sigma$. The shape is fixed, just scaled and shifted by these parameters. The constant $\frac{1}{\sigma \sqrt{2\pi}}$ is a normalizing constant that ensures the total area under the curve is 1.
In most introductory statistics applications, probabilities for a normal distribution are found using the Standard Normal Distribution (Z-scores) and Z-tables rather than direct integration of this formula.
Standard Normal Distribution and Z-scores
The Standard Normal Distribution ($Z$)
Calculating probabilities for every possible normal distribution $N(\mu, \sigma^2)$ by integrating its probability density function (PDF) formula directly would be impractical. To simplify probability calculations for normal distributions, we use a single, standardized normal distribution as a universal reference. This is called the **Standard Normal Distribution**.
Definition: The Standard Normal Distribution is a special case of the normal distribution where the **mean ($\mu$) is 0** and the **standard deviation ($\sigma$) is 1**. Consequently, its variance ($\sigma^2$) is also 1.
A random variable that follows the standard normal distribution is conventionally denoted by the letter $Z$. Thus, $Z \sim N(\mu=0, \sigma^2=1)$.
The properties of the standard normal distribution are the same as any normal distribution (bell shape, symmetry, unimodal, asymptotic tails, total area = 1), but its centering at 0 and spread of 1 make it convenient for standardization.
Standardization and Z-scores
The process of converting any value from an arbitrary normal distribution $X \sim N(\mu, \sigma^2)$ into a corresponding value on the standard normal distribution $Z \sim N(0, 1)$ is called **standardization**. The transformed value is known as a **Z-score** or a standard score.
The formula for standardizing an observation $x$ from a normal distribution with mean $\mu$ and standard deviation $\sigma$ is:
$$Z = \frac{x - \mu}{\sigma}$$
... (1)
Interpretation of a Z-score:
A Z-score tells you exactly how many standard deviations an original observation $x$ is away from the mean $\mu$ of its distribution. The sign of the Z-score indicates whether the observation is above or below the mean:
- If $Z$ is positive, the original value $x$ is above the mean $\mu$.
- If $Z$ is negative, the original value $x$ is below the mean $\mu$.
- If $Z = 0$, the original value $x$ is exactly equal to the mean $\mu$.
For example, a Z-score of $Z=2$ means the value $x$ is 2 standard deviations above the mean. A Z-score of $Z=-0.5$ means the value $x$ is half a standard deviation below the mean.
Purpose of Standardization:
The primary reason for standardization is to be able to use a single table (the Standard Normal or Z-table) or standard statistical functions in calculators/software to find probabilities for ANY normal distribution. Once an $X$ value is converted to its corresponding $Z$-score, the probability associated with that $X$ value (e.g., the probability of getting a value less than $x$) is the same as the probability associated with the $Z$-score in the standard normal distribution.
Example
Example 1. Scores on a statistics test are normally distributed with a mean of 70 and a standard deviation of 8. Find the Z-scores for students who scored:
(a) 78
(b) 62
(c) 70
Answer:
Given: Normal distribution with mean $\mu = 70$ and standard deviation $\sigma = 8$.
To Find: Z-scores for given scores.
Solution:
We use the standardization formula $Z = \frac{X - \mu}{\sigma}$ (Formula 1) with $\mu=70$ and $\sigma=8$.
**(a) Score $X = 78$:**
$$Z = \frac{78 - 70}{8}$$
... (ii)
$$Z = \frac{8}{8} = 1$$
... (iii)
The Z-score for a score of 78 is 1. This means a score of 78 is exactly 1 standard deviation above the mean.
**(b) Score $X = 62$:**
$$Z = \frac{62 - 70}{8}$$
... (iv)
$$Z = \frac{-8}{8} = -1$$
... (v)
The Z-score for a score of 62 is -1. This means a score of 62 is exactly 1 standard deviation below the mean.
**(c) Score $X = 70$:**
$$Z = \frac{70 - 70}{8}$$
... (vi)
$$Z = \frac{0}{8} = 0$$
... (vii)
The Z-score for a score of 70 is 0. This means a score of 70 is exactly at the mean.
Area Under the Normal Curve and its Interpretation (using Z-tables)
Area Represents Probability
For any continuous random variable, the probability that the variable falls within a specific range of values is represented by the **area under the probability density function (PDF) curve** over that range. For a normally distributed variable $X \sim N(\mu, \sigma^2)$:
The probability that $X$ is between values $a$ and $b$, denoted $P(a \le X \le b)$, is equal to the area under the normal curve between $x=a$ and $x=b$.
Since the normal distribution is continuous, the probability of $X$ being exactly equal to any single value is zero ($P(X=x) = 0$). Therefore, for any constants $a$ and $b$, $P(a \le X \le b) = P(a < X \le b) = P(a \le X < b) = P(a < X < b)$. The inclusion or exclusion of the endpoints does not affect the probability (area).
Using the Standard Normal (Z) Distribution for Probabilities
Because any normal distribution can be transformed into the standard normal distribution using $Z = (X - \mu) / \sigma$, we can calculate probabilities for any $X \sim N(\mu, \sigma^2)$ by finding the corresponding area under the standard normal curve $Z \sim N(0, 1)$.
The probability statement involving $X$ can be converted into an equivalent probability statement involving $Z$. For example, to find $P(a \le X \le b)$, we standardize the values $a$ and $b$ to obtain Z-scores $z_1 = (a-\mu)/\sigma$ and $z_2 = (b-\mu)/\sigma$. Then, the probability $P(a \le X \le b)$ is equal to the area under the standard normal curve between $z_1$ and $z_2$, i.e., $P(z_1 \le Z \le z_2)$.
$$P(a \le X \le b) = P\left(\frac{a-\mu}{\sigma} \le Z \le \frac{b-\mu}{\sigma}\right) = P(z_1 \le Z \le z_2)$$
... (iii)
Standard Normal (Z) Tables
Standard Normal tables, commonly known as **Z-tables**, provide pre-calculated areas under the standard normal curve $Z \sim N(0, 1)$ for various Z-scores. The type of table used is important:
- A common type of Z-table gives the **cumulative probability** $P(Z \le z)$, which is the area under the standard normal curve to the **left** of a given Z-score $z$.
- Other tables might give the area between 0 and $z$, or the area in the tails. Always check the diagram provided with the table to understand what area it represents.
Assuming we use a Z-table that provides $P(Z \le z)$ (area to the left):
- To find $P(Z \le z)$: Look up the Z-score $z$ in the table and read the corresponding probability.
- To find $P(Z \ge z)$: This is the area to the right of $z$. Since the total area under the curve is 1, $P(Z \ge z) = 1 - P(Z < z)$. Because $Z$ is continuous, $P(Z < z) = P(Z \le z)$. So, $P(Z \ge z) = 1 - P(Z \le z)$. Look up $P(Z \le z)$ and subtract from 1.
- To find $P(a \le Z \le b)$: This is the area between $a$ and $b$. It is the cumulative area up to $b$ minus the cumulative area up to $a$. $P(a \le Z \le b) = P(Z \le b) - P(Z \le a)$. Look up both $P(Z \le b)$ and $P(Z \le a)$ in the table and subtract.
- To find $P(Z \le -z)$ for $z > 0$: Use symmetry. The area to the left of $-z$ is equal to the area to the right of $z$. $P(Z \le -z) = P(Z \ge z) = 1 - P(Z \le z)$.
Modern statistical calculators and software can compute normal probabilities directly given $\mu$, $\sigma$, and the interval, without requiring manual Z-score conversion or table lookups.
Example
Example 1. Using the test score data from Example 1, Section I3 ($X \sim N(\mu=70, \sigma=8)$), find the probability that a randomly selected student scored:
(a) Less than 78
(b) More than 62
(c) Between 62 and 78
Answer:
Given: Test scores are normally distributed with mean $\mu=70$ and standard deviation $\sigma=8$.
To Find: Probabilities for specific score ranges.
Solution:
We convert the given X scores to Z-scores using the formula $Z = (X - \mu) / \sigma$. From Example 1, Section I3:
- A score of $X=78$ corresponds to $Z = \frac{78 - 70}{8} = \frac{8}{8} = 1$.
- A score of $X=62$ corresponds to $Z = \frac{62 - 70}{8} = \frac{-8}{8} = -1$.
- A score of $X=70$ corresponds to $Z = \frac{70 - 70}{8} = \frac{0}{8} = 0$.
**(a) Probability of scoring less than 78:** $P(X < 78)$.
Convert the inequality to Z-scores:
$$P(X < 78) = P\left(Z < \frac{78 - 70}{8}\right) = P(Z < 1)$$
... (iv)
Using a standard normal (Z) table that gives area to the left (or a calculator function):
$$P(Z < 1) = P(Z \le 1) \approx 0.8413$$
(From Z-table for Z=1.00) ... (v)
The probability of scoring less than 78 is approximately 0.8413.
**(b) Probability of scoring more than 62:** $P(X > 62)$.
Convert the inequality to Z-scores:
$$P(X > 62) = P\left(Z > \frac{62 - 70}{8}\right) = P(Z > -1)$$
... (vi)
Using the complement rule $P(Z > -1) = 1 - P(Z \le -1)$. Look up $Z=-1.00$ in the table:
$$P(Z \le -1) \approx 0.1587$$
(From Z-table for Z=-1.00) ... (vii)
$$P(X > 62) = 1 - P(Z \le -1) \approx 1 - 0.1587 = 0.8413$$
... (viii)
The probability of scoring more than 62 is approximately 0.8413. (Note the symmetry: $P(X > \mu - c) = P(X < \mu + c)$ for symmetric distributions).
**(c) Probability of scoring between 62 and 78:** $P(62 \le X \le 78)$.
Convert the interval to Z-scores:
$$P(62 \le X \le 78) = P\left(\frac{62 - 70}{8} \le Z \le \frac{78 - 70}{8}\right) = P(-1 \le Z \le 1)$$
... (ix)
Using the property for area between two Z-scores: $P(-1 \le Z \le 1) = P(Z \le 1) - P(Z \le -1)$.
We already found $P(Z \le 1) \approx 0.8413$ (from v) and $P(Z \le -1) \approx 0.1587$ (from vii).
$$P(62 \le X \le 78) \approx 0.8413 - 0.1587 = 0.6826$$
... (x)
The probability of scoring between 62 and 78 is approximately 0.6826.
This result is consistent with the Empirical Rule, which states that approximately 68% of data in a normal distribution falls within one standard deviation of the mean ($\mu \pm \sigma = 70 \pm 8 = [62, 78]$).
Summary of results:
(a) $P(X < 78) \approx 0.8413$
(b) $P(X > 62) \approx 0.8413$
(c) $P(62 \le X \le 78) \approx 0.6826$